SSIS 2008 SCD Performance Issues (Network Steve Forum)

SSIS 2008 SCD Performance Issues

Hi We are having 20 dimension tables and each table will be having around 20 million records. These tables would be loaded on a daily frequency with 5 files, each of 3 million records. We are currently using SCD transformation for TYPE2 load of data.( to maintain history in the dimension table.) But SCD is taking a long time to insert the data and below are the statistics that I recorded when I executed the package with sample files: Run1:File1(0.5 million records) 2 minutes (Dimension Table is empty)Run2:File2(0.5 million records) 13 minutes (Table has 589,000 records)Run3:File3(0.5 million records) 26 minutes (Table has 1,140,000 records)Run4:File4(0.5 million records) 37 minutes (Table has 1,680,000 records)Run5:File5(1 million records) 51 minutes (Table has 2,780,000 records) Package elapsed time : 2 hr 9 min 1. How do i improve the performance of the SCD? If not, is there any way of loading a table parallely from file so that i can achive performance? 2. In informatica, we have a partitioning feature to load the data parallely which greatly improves performance. Is there any equivalent feature or workaround in SSIS? Any help would be greatly appreciated. Thanks,BhaskarDas

May 19th, 2009 7:12pm

This task does a row by row evaluation. The below link shows the same question and is answered by Matthew Roche and provides insight on the issue:http://social.msdn.microsoft.com/Forums/en-US/sqlintegrationservices/thread/d3aca933-7473-49fc-8926-fd96305e8729/Hope thsi helpsDavid Dye

Free Windows Admin Tool Kit Click here and download it now

May 19th, 2009 7:16pm

There is an alternative component available on CodePlex, the Kimball SCD component. I must admit, I'm biased - I wrote it and am coordinator of the project.It outperforms the stock SCD component because it doesn't do RBAR lookups - but it does this at the expense of requiring that you provide a flow into the component with the entire existing dimension table. In my situation (which may not be typical - I can't say) I experienced significant performance increases at what I think areremarkably low costs. For example, a data set that took 29 minutes to process with the Wizard took 15 seconds with this component - at the cost of 400MB RAM versus 40MB. This is starting to sound like a commercial (sorry).Anyway - I suggest you look that component over, as well as reviewing the thread that's linked in the FAQ with a title something like "updating rows that exist and inserting ones that don't". It discusses rolling your own SCD.Now time to put my "ad" in that thread David referenced... :)

May 19th, 2009 7:36pm

Thanks todd and david,i have already seen Kimball SCD component.it seems we have install this component.but i will not be allowed to implement those components in my projects.Could you please suggestany otherway?BhaskarDas

Free Windows Admin Tool Kit Click here and download it now

May 19th, 2009 7:49pm

Your only other option is to roll your own using Conditional Splits, Derived Columns, Lookups, ... And for what reason would you not be able to use a third-party component?

May 19th, 2009 8:24pm

Another alternative is to populate the dimension in 2 steps using available transformation and components. You a lookup to decide if the row has to be inserted or updated, then the the rows to be updated are inserted in a working table in the DW database. Then you usean execute sql task to do the updates of existing rows in the dimension by joining the working table. This same approach could be modified slightly to handle SCD2 changes.I haven't used Todd's component, but I know he has put a lot of work on it and definitelyis a valid option.Rafael Salas | Dont forget to mark the post(s) that answered your question http://rafael-salas.blogspot.com/

Free Windows Admin Tool Kit Click here and download it now

May 19th, 2009 8:58pm

Yeah, if you are performing lots of updates then you do not want to use the SCD that includes the OLE COMMAND since that will be on a row-by-row basis. Look at Andy's article in regards to using a set-based update approach instead - Anatomy of an Incremental Load (http://www.sqlservercentral.com/articles/SSIS/62063/). Also if you are using SSIS 2008 you can use the Cache Transform for your lookups.You could definitely break this process out in the Data Flow Task to load these files at the same time if needed. If you are testing though, watch your file space for the temp directories on the data flow task. Make sure you set the MaxConcurrentProcesses and EngineThread properties appropriately too. _____________________________________________________ Please mark posts as answer or helpful when they are.

May 20th, 2009 5:03am

Just to re-iterate what the other guys have said... The SCD component and the nature of the components that the SCD wizard produces are slow. This is mainly because of the OLE DB Command (as someone suggested above) so get rid of that by instead inserting the data into a staging table and using a T-SQL UPDATE statement to do the update. The SCD component itself is also very slow - although given the size of your destination data set you may need to experiement to see if a full-cached Lookup component performs any quicker. Regards Jamiehttp://jamiethomson.spaces.live.com/ | http://blogs.conchango.com/jamiethomson | @jamiet

Free Windows Admin Tool Kit Click here and download it now

May 20th, 2009 11:27am

Hi, just an idea... I find strange that execution time explodes each time you load a new file... Perhaps your job does some control while loading data into table... And perhaps statistics on database objects (indexes) are not recomputed between each file (auto statistics disable ?)... So, execution plan is based on an empty table from beginning to end... Or perhaps lack of index can explain that... What happen if you perform an UPDATE STATISTICS between each load ? Fabrice

May 20th, 2009 11:36am

Thanks Jamie and Fabrice.Actually i have two staging tables for doing some column level (stage1)and lookup validations(stage2). i have base and child table functionality and Currently one base table and 20 child table.In base table SCD ,when the change happens to Variant column for set of businesskey all the child tables would be updated as we have enabled on update cascade option for all the child tables.( Currently child table SCD is just inserting rows)5 files i have tested with are not causing any updates so as i mensioned above all records are inserted.here my doubt even for just inserting SCD taking so long time.worring about production if dimension grows in billion package may run for even 6 hrs.i could not even able to convert SCD into lookup and conditional split sincelot ofchangesare required in my package.(not allowed ).questions:How can i gain performance with Cach Lookup tranformation when this should also perform row by row operation?Can u plz suggest me approach which required minimall changes to my package?Are there any settings to be changed to SCD?Is there any of loading the same table parallely in streams which might improve performance(i feel)?i would not be allowed to use third party SCD components.Please guide some good approach which would greatly help me in improving package performance.BhaskarDas

Free Windows Admin Tool Kit Click here and download it now

May 20th, 2009 9:20pm

Actually planning to build index on business keys and test the performance. please leave the comment on this as well.Das

May 20th, 2009 9:22pm

A parallel design could improve your performance - but could also introduce some race conditions on identical business keys unless you're careful.You can divide up your flow into smaller sets, and then process those separate sets in parallel. In order to divide up your flow, you would have to use a Script component (because you can't use a pre-made Row Number component) to add a column to your flow with a "division number", then use a Conditional Split component to split your flow according to that division number. Then route each flow into a separate SCD Wizard component (ugh).Unfortunately, you're multiplying some of the "bad" effects of the SCD Wizard at the same time - specifically the problems with editing it.

Free Windows Admin Tool Kit Click here and download it now

May 20th, 2009 10:50pm

Have you considered using t-sql MERGE? this is a set based operation as opposed to a row by row. http://technet.microsoft.com/en-us/library/bb510625.aspx I am not familiar with the details of the issue at hand so i am not sure this is a good alternative but i figured i would mention it. Luis

November 15th, 2010 2:56pm

This topic is archived. No further replies will be accepted.